11 research outputs found

    A survey on online active learning

    Full text link
    Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream. The problem of minimizing the cost associated with collecting labeled observations has gained a lot of attention in recent years, particularly in real-world applications where data is only available in an unlabeled form. Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data. To overcome this issue, many active learning strategies have been proposed in the last decades, aiming to select the most informative observations for labeling in order to improve the performance of machine learning models. These approaches can be broadly divided into two categories: static pool-based and stream-based active learning. Pool-based active learning involves selecting a subset of observations from a closed pool of unlabeled data, and it has been the focus of many surveys and literature reviews. However, the growing availability of data streams has led to an increase in the number of approaches that focus on online active learning, which involves continuously selecting and labeling observations as they arrive in a stream. This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in the context of online active learning. We review the various techniques that have been proposed and discuss their strengths and limitations, as well as the challenges and opportunities that exist in this area of research. Our review aims to provide a comprehensive and up-to-date overview of the field and to highlight directions for future work

    Online Active Learning for Soft Sensor Development using Semi-Supervised Autoencoders

    Full text link
    Data-driven soft sensors are extensively used in industrial and chemical processes to predict hard-to-measure process variables whose real value is difficult to track during routine operations. The regression models used by these sensors often require a large number of labeled examples, yet obtaining the label information can be very expensive given the high time and cost required by quality inspections. In this context, active learning methods can be highly beneficial as they can suggest the most informative labels to query. However, most of the active learning strategies proposed for regression focus on the offline setting. In this work, we adapt some of these approaches to the stream-based scenario and show how they can be used to select the most informative data points. We also demonstrate how to use a semi-supervised architecture based on orthogonal autoencoders to learn salient features in a lower dimensional space. The Tennessee Eastman Process is used to compare the predictive performance of the proposed approaches.Comment: ICML 2022 Workshop on Adaptive Experimental Design and Active Learning in the Real Worl

    Stream-based active learning with linear models

    Full text link
    The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.Comment: Published in Knowledge-Based Systems (2022

    Active Learning for Data Streams

    No full text
    As businesses increasingly rely on machine learning models to make informed decisions, the ability to develop accurate and reliable models is critical. However, in many industrial contexts, data annotation represents a major bottleneck to the training and deployment of predictive models. This thesis focuses on data-efficient strategies for developing machine learning models in label-scarce settings. The increasing availability of unlabeled data in various applications has led to the need for efficient methods that minimize the cost associated with collecting labeled observations. Traditional active learning approaches, such as pool-based methods, have been extensively studied, but the emergence of data streams has necessitated the development of stream-based active learning strategies able to select the most informative observations from data streams in real time.The thesis begins with a survey of active learning, providing an overview of recently proposed approaches for selecting informative observations from data streams. It presents the strengths and limitations of the state of the art and discusses the challenges and opportunities that arise in this area of research. Next, the thesis presents a novel stream-based active learning strategy for linear models inspired by the optimal experimental design theory. By setting a threshold on the informativeness of unlabeled data points, the proposed strategy enables the learner to decide in real time whether to label an instance or discard it. Then, the thesis investigates the robustness of online active learning in the presence of outliers and irrelevant features. The thesis also provides initial results related to an adaptive sampling scheme for drifting regression data streams.Finally, the thesis presents a stream-based active distillation framework for developing lightweight yet powerful object detection models. This approach combines active learning and knowledge distillation, allowing a compact student model to be finetuned using pseudo-labels generated by a large pre-trained teacher model.Overall, this thesis contributes to the field of stream-based active learning by providing insights into various techniques and addressing concerns related to robustness and scalability. The findings expand the potential applications of active learning in real-time data streams and pave the way for more efficient and effective model development.<br/

    What drives a donor? A machine learning‐based approach for predicting responses of nonprofit direct marketing campaigns

    Get PDF
    Direct marketing campaigns are one of the main fundraising sources for nonprofit organizations and their effectiveness is crucial for the sustainability of the organizations. The response rate of these campaigns is the result of the complex interaction between several factors, such as the theme of the campaign, the month in which the campaign is launched, the history of past donations from the potential donor, as well as several other variables. This work, applied on relevant data gathered from the World Wide Fund for Nature Italian marketing department, undertakes different data mining approaches in order to predict future donors and non-donors, thus allowing for optimization in the target selection for future campaigns, reducing its overall costs. The main challenge of this research is the presence of thoroughly imbalanced classes, given the low percentage of responses per total items sent. Different techniques that tackle this problem have been applied. Their effectiveness in avoiding a biased classification, which is normally tilted in favor of the most populated class, will be highlighted. Finally, this work shows and compares the classification results obtained with the combination of sampling techniques and Decision Trees, ensemble methods, and Artificial Neural Networks. The testing approach follows a walk-forward validation procedure, which simulates a production environment and reveals the ability to accurately classify each future campaign

    Hidden dimensions of the data:PCA vs autoencoders

    No full text
    Principal component analysis (PCA) has been a commonly used unsupervised learning method with broad applications in both descriptive and inferential analytics. It is widely used for representation learning to extract key features from a dataset and visualize them in a lower dimensional space. With more applications of neural network-based methods, autoencoders (AEs) have gained popularity for dimensionality reduction tasks. In this paper, we explore the intriguing relationship between PCA and AEs and demonstrate, through some examples, how these two approaches yield similar results in the case of the so-called linear AEs (LAEs). This study provides insights into the evolving landscape of unsupervised learning and highlights the relevance of both PCA and AEs in modern data analysis.</p

    Robust online active learning

    Full text link
    In many industrial applications, obtaining labeled observations is not straightforward as it often requires the intervention of human experts or the use of expensive testing equipment. In these circumstances, active learning can be highly beneficial in suggesting the most informative data points to be used when fitting a model. Reducing the number of observations needed for model development alleviates both the computational burden required for training and the operational expenses related to labeling. Online active learning, in particular, is useful in high-volume production processes where the decision about the acquisition of the label for a data point needs to be taken within an extremely short time frame. However, despite the recent efforts to develop online active learning strategies, the behavior of these methods in the presence of outliers has not been thoroughly examined. In this work, we investigate the performance of online active linear regression in contaminated data streams. Our study shows that the currently available query strategies are prone to sample outliers, whose inclusion in the training set eventually degrades the predictive performance of the models. To address this issue, we propose a solution that bounds the search area of a conditional D-optimal algorithm and uses a robust estimator. Our approach strikes a balance between exploring unseen regions of the input space and protecting against outliers. Through numerical simulations, we show that the proposed method is effective in improving the performance of online active learning in the presence of outliers, thus expanding the potential applications of this powerful tool

    Robust online active learning

    No full text
    In many industrial applications, obtaining labeled observations is not straightforward as it often requires the intervention of human experts or the use of expensive testing equipment. In these circumstances, active learning can be highly beneficial in suggesting the most informative data points to be used when fitting a model. Reducing the number of observations needed for model development alleviates both the computational burden required for training and the operational expenses related to labeling. Online active learning, in particular, is useful in high-volume production processes where the decision about the acquisition of the label for a data point needs to be taken within an extremely short time frame. However, despite the recent efforts to develop online active learning strategies, the behavior of these methods in the presence of outliers has not been thoroughly examined. In this work, we investigate the performance of online active linear regression in contaminated data streams. Our study shows that the currently available query strategies are prone to sample outliers, whose inclusion in the training set eventually degrades the predictive performance of the models. To address this issue, we propose a solution that bounds the search area of a conditional D-optimal algorithm and uses a robust estimator. Our approach strikes a balance between exploring unseen regions of the input space and protecting against outliers. Through numerical simulations, we show that the proposed method is effective in improving the performance of online active learning in the presence of outliers, thus expanding the potential applications of this powerful tool.</p
    corecore